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Abstract 

The problem of bipartite ranking, where instances are labeled positive or negative and the 
goal is to learn a scoring function that minimizes the probability of mis-ranking a pair of positive 
and negative instances (or equivalently, that maximizes the area under the ROC curve), has 
been widely studied in recent years. A dominant theoretical and algorithmic framework for the 
problem has been to reduce bipartite ranking to pairwise classification; in particular, it is well 
known that the bipartite ranking regret can be formulated as a pairwise classification regret, 
which in turn can be upper bounded using usual regret bounds for classification problems. 
Recently, Kotlowski et al. (2011) showed regret bounds for bipartite ranking in terms of the 
regret associated with balanced versions of the standard (non-pairwise) logistic and exponential 
losses. In this paper, we show that such (non-pairwise) surrogate regret bounds for bipartite 
ranking can be obtained in terms of a broad class of proper (composite) losses that we term as 
strongly proper. Our proof technique is much simpler than that of Kotlowski et al. (2011), and 
relies on properties of proper (composite) losses as elucidated recently by Reid and Williamson 
(2010, 2011) and others. Our result yields explicit surrogate bounds (with no hidden balancing 
terms) in terms of a variety of strongly proper losses, including for example logistic, exponential, 
squared and squared hinge losses as special cases. We also obtain tighter surrogate bounds under 
certain low- noise conditions via a recent result of Clemengon and Robbiano (2011). 

1 Introduction 



Ranking problems arise in a variety of applications ranging from information retrieval to recommen- 
dation systems and from computational biology to drug discovery, and have been widely studied 
in machine learning and statistics in the last several years. Recently, there has been much interest 
in understanding statistical consistency and regret behavior of algorithms for a variety of ranking 
problems, including various forms of label/subset ranking as well as instance ranking problems 
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In this paper, we study regret bounds for the bipartite instance ranking problem, where instances 
are labeled positive or negative and the goal is to learn a scoring function that minimizes the prob- 
ability of mis-ranking a pair of positive and negative instances, or equivalently, that maximizes the 
area under the ROC curve |14| [1]. A popular algorithmic and theoretical approach to bipartite 
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ranking has been to treat the problem as analogous to pairwise classification [T7J [TBI EH El E]- 
Indeed, this approach enjoys theoretical support since the bipartite ranking regret can be formu- 
lated as a pairwise classification regret, and therefore any algorithm minimizing the latter over a 
suitable class of functions will also minimize the ranking regret (this follows formally from results 
of [8j; see Section 13.11 for a summary). Nevertheless, it has often been observed that algorithms 
such as AdaBoost, logistic regression, and in some cases even SVMs, which minimize the exponen- 
tial, logistic, and hinge losses respectively in the standard (non-pairwise) setting, also yield good 
bipartite ranking performance (111 1201 [24]. F° r losses such as the exponential or logistic losses, this 
is not surprising since algorithms minimizing these losses (but not the hinge loss) are known to 
effectively estimate conditional class probabilities |31j ; since the class probability function provides 
the optimal ranking [8j, it is intuitively clear (and follows formally from results in [HI [9]) that any 
algorithm providing a good approximation to the class probability function should also produce a 
good ranking. However, there has been very little work so far on quantifying the ranking regret of 
a scoring function in terms of the regret associated with such surrogate losses. 

Recently, [19] showed that the bipartite ranking regret of a scoring function can be upper 
bounded in terms of the regret associated with balanced versions of the standard (non-pairwise) 
exponential and logistic losses. However their proof technique builds on analyses involving the 
reduction of bipartite ranking to pairwise classification, and involves analyses specific to the expo- 
nential and logistic losses (see Section I3.2p . More fundamentally, the balanced losses in their result 
depend on the underlying distribution and cannot be optimized directly by an algorithm; while it 
is possible to do so approximately, one then loses the quantitative nature of the bounds. 

In this work we obtain quantitative regret bounds for bipartite ranking in terms of a broad 
class of proper (composite) loss functions that we term strongly proper. Our proof technique is 
considerably simpler than that of [19], and relies on properties of proper (composite) losses as 
elucidated recently for example in [22j [23], [15l [6] . Our result yields explicit surrogate bounds (with 
no hidden balancing terms) in terms of a variety of strongly proper (composite) losses, including 
for example logistic, exponential, squared and squared hinge losses as special cases. We also obtain 
tighter surrogate bounds under certain low-noise conditions via a recent result of [9]. 

The paper is organized as follows. In Section [2] we formally set up the bipartite instance 
ranking problem and definitions related to loss functions and regret, and provide background on 
proper (composite) losses. Section [3] summarizes related work that provides the background for our 
study, namely the reduction of bipartite ranking to pairwise binary classification and the result of 
|19j . In Section U] we define and characterize strongly proper losses. Section [5] contains our main 
result, namely a bound on the bipartite ranking regret in terms of the regret associated with any 
strongly proper loss, together with several examples. Section [6] gives a tighter bound under certain 
low- noise conditions via a recent result of [9]. We conclude with a brief discussion and some open 
questions in Section [71 



2 Formal Setup, Preliminaries, and Background 

This section provides background on the bipartite ranking problem, binary loss functions and regret, 
and proper (composite) losses. 
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2.1 Bipartite Ranking 



As in binary classification, in bipartite ranking there is an instance space X and binary labels 
3^ = {±1}, with an unknown distribution Don<Yx {±1}- For (X, Y) ~ D and x G X, we denote 
r/0) = P(Y" = 1 | X = x) and p = P(Y = 1). Given i.i.d. examples {X X ,Y{), (X n , Y n ) ~ D, 
the goal is to learn a scoring function / : X— >M* (where M* = [—00,00]) that assigns higher scores 
to positive instances than to negative onesQ Specifically, the goal is to learn a scoring function / 
with low ranking error (or ranking risk), defined aa^| 

er r D ank [/] = E[l((Y - Y')(f{X) - f(X')) < 0) + \ l{f(X) = f(X')) \Y^Y'], (1) 

where (X, Y), (X' ,Y') are assumed to be drawn i.i.d. from D, and l(-) is 1 if its argument is true 
and otherwise; thus the ranking error of / is simply the probability that a randomly drawn 
positive instance receives a lower score under / than a randomly drawn negative instance, with ties 
broken uniformly at random. The optimal ranking error (or Bayes ranking error or Bayes ranking 
risk) can be seen to be 



„rank,* _ „„rank 
1 



inf erg nk [/] (2) 



-E 



X.X 1 



min (r?(X)(l - r,(X% r,(X')(l - v(X)))] . (3) 



2p(l-p)" 

The ranking regret of a scoring function / : X— >~R* is then simply 

regretS nk [/] = er r D ank [/] - er r D ank '* . (4) 

We will be interested in upper bounding the ranking regret of a scoring function / in terms of its 
regret with respect to certain other (binary) loss functions. In particular, the loss functions we 
consider will belong to the class of proper (composite) loss functions. Below we briefly review some 
standard notions related to loss functions and regret, and then discuss some properties of proper 
(composite) losses. 



2.2 Loss Functions, Regret, and Conditional Risks and Regret 

Assume again a probability distribution D on X x {±1} as above. Given a prediction space y C R*, 
a binary loss function I : {±1} x y— >R*^_ (where = [0, 00]) assigns a penalty £(y, y) for predicting 
y G y when the true label is y G {±1}H For any such loss £, the l-error (or l-risk) of a function 
/ : X—±y is defined as 

er e D [f]=E (X)Y ^ D [£(Y,f(X))], (5) 
and the optimal l-error (or optimal t-risk or Bayes l-risk) is defined as 

e#= infV D [/]. (6) 

1 Most algorithms learn real-valued functions; we also allow values —00 and 00 for technical reasons. 
2 We assume measurability conditions where necessary. 

3 Most loss functions take values in R+, but some loss functions (such as the logistic loss, described later) can 
assign a loss of 00 to certain label-prediction pairs. 
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The £-regret of a function / : X^y is the difference of its £-error from the optimal £-error: 

regretf^/] = erf, [/]-e4*. (7) 

The conditional t-risk : [0, 1] x y— >W + is defined a£| 

L £ (rj,y) = E Y ^[£(Y,y)} = r)£(l,y) + (1 - r,)£(-l,y) , (8) 

where Y ~ rj denotes a {±l}-valued random variable taking value +1 with probability n. The 
conditional Bayes £-risk Hi : [0, 1]— >W + is defined as 

H t {rj)= inf L e ( V ,y). (9) 

The conditional £-regret Ri : [0, 1] x y— is then simply 

Rtfay) = L t (ri,y) - H e ( V ) . (10) 

Clearly, we have for / : X— >y, 

e/ D [f]=E x [L e (n(X),f(X))], (11) 

and 

e# = -E x [H t (r,(X))]. (12) 

We note the following: 

Lemma 1. For any J'CR* and binary loss £ : {±1} x y~ yR*,, the conditional Bayes £-risk is 
a concave function on [0, 1] . 

The proof follows simply by observing that Hi is defined as the pointwise infimum of a family 
of linear (and therefore concave) functions, and therefore is itself concave. 

2.3 Proper and Proper Composite Losses 

In this section we review some background material related to proper and proper composite losses, 
as studied recently in [22], [231 13 E] • While the material is meant to be mostly a review, some of 
the exposition is simplified compared to previous presentations, and we include a new, simple proof 
of an important fact (Theorem 

Proper Losses. We start by considering binary class probability estimation (CPE) loss functions 
that operate on the prediction space y = [0, 1]. A binary CPE loss function c : {±1} x [0, 1]— >M*, 
is said to be proper if for all n G [0, 1], 

r] £ arg min L c (n, rj) , (13) 

r?e[0,l] 

and strictly proper if the minimizer is unique for all rj £ [0, 1]. Equivalently, c is proper if for all 
n £ [0,1], H c (rj) = L c (t],t]), and strictly proper if H c (rj) < L c (r],ff) for all rj ^ n. We have the 
following basic result: 

4 Note that we overload notation by using r\ here to refer to a number in [0, 1]; the usage should be clear from 
context. 
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Lemma 2 ([15, 26J). Let c : {±1} x [0, 1]— >R+ be a binary CPE loss. If c is proper, then c(l, •) 
is a decreasing function on [0, 1] and c(— 1, •) is an increasing function. If c is strictly proper, then 
c(l, •) is strictly decreasing on [0, 1] and c(— 1, •) is strictly increasing. 

We will find it useful to consider regular proper losses. As in [15], we say a binary CPE loss 
c : {±1} x [0,l]-)-M^ is regular if 0(1,77) € M+ Vr? € (0,1] and c(-l,rj) el+Vije [0,1), i.e. if 
c(y, rj) is finite for all y, rj except possibly for c(l, 0) and c(— 1, 1), which are allowed to be infinite. 
The following characterization of regular proper losses is well known (see also [15J): 

Theorem 3 ([25]). A regular binary CPE loss c : {±1} x [0, 1]— kRI. is proper if and only if for all 
n,rj G [0, 1] there exists a superderivative H' c (rj) of H c at rj such tha^ 

L c (r), rj) = H c {rj) + (rj-rj)- H' c (rj) . 

The following is a characterization of strict properness of a proper loss c in terms of its condi- 
tional Bayes risk H c : 

Theorem 4. A proper loss c : {±1} x [0, 1]— >-Kl is strictly proper if and only if H c is strictly 
concave. 

This result can be proved in several ways. A proof in [15J is attributed to an argument in [16] . 
If H c is twice differentiable, an alternative proof follows from a result in [6j[26], which shows that a 
proper loss c is strictly proper if and only if its 'weight function' w c = —H" satisfies w c (rj) > for 
all except at most countably many points n G [0, 1]; by a very recent result of [27], this condition is 
equivalent to strict convexity of the function — H c , or equivalently, strict concavity of H c . Here we 
give a third, self-contained proof of the above result that is derived from first principles, and that 
will be helpful when we study strongly proper losses in Section 01 

Proof of Theorem^ Let c : {±1} x [0, 1]— >M*, be a proper loss. For the 'if direction, assume H c 
is strictly concave. Let 77, 77 £ [0, 1] such that rj 7^ r\. Then we have 

L c (r],rj) - H c (n) = L c (rj,rj) + H c (fj) - H c (rj) - H c {r,) 

= L c (rj,rj) + H c ^)-^ H c(ri) + h H rM) 

> L c {n,?,) + H c {?j)-^H C {^-) 




> 0. 



Thus c is strictly proper. 

Conversely, to prove the 'only if direction, assume c is strictly proper. Let 771 , 772 £ [0,1] such 
that 771 7^ 772, and let t £ (0, 1). Then we have 

H c (t m + (1 - t) m ) = L c (% + (1 - t)rj 2 , % + (1 - t)rj 2 ) 

= tL c ( m , trji + (1 - t) m ) + (1 - t) L c ( m , t m + (1 - t) m ) 
> tH c ( m ) + (1 - t) H c { m ) . 

5 Here u € R is a superderivative of H c at rj if for all rj £ [0, 1], H c (fj) — H c (rj) > uirj — rj). 
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Thus H c is strictly concave. 



□ 



Proper Composite Losses. The notion of properness can be extended to binary loss functions 
operating on prediction spaces y other than [0, 1] via composition with a link function ip : [0, 1]— >y. 
Specifically, for any y C W, a loss function I : {±1} x y~ >M. + is said to be proper composite if it 
can be written as 



for some proper loss c : {±1} x [0, 1]— and strictly increasing (and therefore invertible) link 
function ^ : [0, 1]— >y . Proper composite losses have been studied recently in [22l[23j[6], and include 
several widely used losses such as squared, squared hinge, logistic, and exponential losses. 

It is worth noting that for a proper composite loss I formed from a proper loss c, = H c . 
Moreover, any property associated with the underlying proper loss c can also be used to describe 
the composite loss £; thus we will refer to a proper composite loss £ formed from a regular proper 
loss c as regular proper composite, a composite loss formed from a strictly proper loss as strictly 
proper composite, etc. In Section U we will define and characterize strongly proper (composite) 
losses, which we will use to obtain regret bounds for bipartite ranking. 



As noted above, a popular theoretical and algorithmic framework for bipartite ranking has been to 
reduce the problem to pairwise classification. Below we describe this reduction in the context of 
our setting and notation, and then review the result of [19] which builds on this pairwise reduction. 

3.1 Reduction of Bipartite Ranking to Pairwise Binary Classification 

For any distribution D on X x {±1}, consider the distribution D on (X x X) x {±1} defined as 
follows: 



t(y,y) = c(y,ip l (y)) 



(14) 



3 Related Work 



1. Sample (X,Y) and (X',Y') i.i.d. from D; 

2. If Y = Y', then go to step 1; else set0 



x = (x,x'), y = si g n(y-y') 



and return (X,Y). 



Then it is easy to see that, under D 



P(X=(x,x')) 



P(X = x) P(X' = x') (t](x)(1 - r,{x')) + 7](x')(l - n(x 

2p(l-p) 



))) 



(15) 



rj({x,x')) = P(y = 1 | X = (x,x')) 



r](x)(l — rj(x')) 



(16) 



r](x)(l — i](x')) + rj{x'){l — rj{x)) 




(17) 



(i 



Throughout the paper, sign(u) = +1 if u > and —1 otherwise. 
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Moreover, for the 0-1 loss £q-i : {±1} x K0> 1} given by io-i(y,y) = l(y 7^ y), we have the 

following for any pairwise binary classifier h : X x A'— »{±1}: 

ergW = E { x t y^ 5 [l[h(X)^Y)] (18) 



min ^(X),1-^(X) (19) 



0-1,* T7 

er 5 = Ejf 

regretg 1 ^] = erg 1 [h] - er^ 1 '* . (20) 
Now for any scoring function / : X— >M*, define f^m : A x A— ^M* as 

/difr(*,*') = /(x)-Z(x'). (21) 

Then it is easy to see that: 

er r D ank [/] = er|W°/diff] (22) 
er r D ank '* = er^ 1 '*, (23) 

where (g o /)(ti) = g(f(u)). The equality in Eq. (j23|) follows from the fact that the classifier 
h*(x,x') = sign(?7(x) — r](x')) achieves the Bayes 0-1 risk, i.e. er-^/i*] = er- 1 '* |3J. Thus 

regret r D ank [/] = regret^ 1 [sign o / diff ] , (24) 

and therefore the ranking regret of a scoring function / : X— >M* can be analyzed via upper bounds 
on the 0-1 regret of the pairwise classifier (sign o /dig) : X x 

In particular, as noted in [8], applying a result of [I], we can upper bound the pairwise 0-1 
regret above in terms of the pairwise ^-regret associated with any classification-calibrated margin 
loss iff, : {±1} x R*— >M.+ , i.e. any loss of the form i^y, y) = 4>(yy) for some function <f> : W— >R+ 
satisfying V rj G [0, 1], 77 ^ 



y* G arg min L^rj, y) =>- y*(r] - \) > . (25) 



We note in particular that for every proper composite margin loss, the associated link function ip 
satisfies = [22], and therefore every strictly proper composite margin loss is classification- 
calibrated in the sense above@ 

Theorem 5 (@|; see also §f). Let (j) : R*^R* + be such that the margin loss 1$ : {±1} x M*-^ 
defined as l$(y,y) = <fi{yy) is classification- calibrated as above. Then 3 strictly increasing function 
g^ : R* + ^-[0, 1] with g<j,(0) = such that for any f : X x X— >R* , 

regret- 1 [sign o /] < g^ ( regret^ [/] 



7 Note that the setting here is somewhat different from that of [3j and [2], who consider a subset version of bipartite 
ranking where each instance consists of some finite subset of objects to be ranked; there also the problem is reduced 
to a (subset) pairwise classification problem, and it is shown that given any (subset) pairwise classifier h, a subset 
ranking function / can be constructed such that the resulting subset ranking regret is at most twice the subset 
pairwise classification regret of h [3J, or in expectation at most equal to the pairwise classification regret of ft [2]. 

8 We abbreviate = L e<f> , er^, = er^, etc. 

9 We note that in general, every strictly proper (composite) loss is classification-calibrated with respect to any 
cost-sensitive zero-one loss, using a more general definition of classification calibration with an appropriate threshold 
(e.g. see [22]). 
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|4J give a construction for in particular, for the exponential loss given by (j)exp(u) = e~ u and 
logistic loss given by (f>i g(u) = ln(l + e _n ), both of which are strictly proper composite losses (see 
Section f5.2j) and are therefore classification-calibrated, one has 

g exp (z) < V2~z (26) 
9\ og {z) < VTz. (27) 

As we describe below, [19] build on these observations to bound the ranking regret in terms of the 
regret associated with balanced versions of the exponential and logistic losses. 



3.2 Result of Kotlowski et al. (2011) 

For any binary loss I : {±1} x y— >-M+, consider defining a balanced loss ^bai : {±1} x y~ ^+ as 

4ai(v,y) = ^(i>y)-i(y = i) + ^ T 3^^(- 1 ^)- 1 (y = - 1 )- ( 28 ) 

Note that such a balanced loss depends on the underlying distribution D via p = P(Y = 1). Then 
|19j show the following, via analyses specific to the exponential and logistic losses: 



Theorem 6 ([19J). For any f : X^R*, 

regret c ~ p [f d is] < ^regret^ p ' bal [/] 
regret ~ g [/ diff ] < 2 regret ^ g,bal [/] . 

Combining this with the results of Eq. (|24p . Theorem[5l and Eqs. (|26ti27|) then gives the following 
bounds on the ranking regret of any scoring function / : X— kR* in terms of the (non-pairwise) 
balanced exponential and logistic regrets of /: 



regret^/] < ^^^^[f] (29) 



regret^/] < 2 ^regret 1 ^ 1 [/] . (30) 

This suggests that an algorithm that produces a function / : X—>M* with low balanced exponential 
or logistic regret will also have low ranking regret. Unfortunately, since the balanced losses depend 
on the unknown distribution D, they cannot be optimized by an algorithm directlvl^l |19| provide 
some justification for why in certain situations, minimizing the usual exponential or logistic loss may 
also minimize the balanced versions of these losses; however, by doing so, one loses the quantitative 
nature of the above bounds. Below we obtain upper bounds on the ranking regret of a function 
/ directly in terms of its loss-based regret (with no balancing terms) for a wide range of proper 
(composite) loss functions that we term strongly proper, including the exponential and logistic losses 
as special cases. 



10 We note it is possible to optimize approximately balanced losses, e.g. by estimating p from the data. 
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4 Strongly Proper Losses 

We define strongly proper losses as follows: 

Definition 7. Let c : {±1} x [0, 1]— yM*^ be a binary CPE loss and let A > 0. We say c is A-strongly 
proper if for all T],ff G [0, 1], 

L c (t], rf) - H c {ri) > ^(r]-v) 2 . 

We have the following necessary and sufficient conditions for strong properness: 

Lemma 8. Let A > 0. Lf c : {±1} x [0, 1]— >W.+ is X-strongly proper, then H c is \-strongly concave. 

Proof. The proof is similar to the 'only if direction in the proof of Theorem [H Let c be A-strongly 
proper. Let 771,772 £ [0, 1] such that r\\ ^ 772, and let t 6 (0, 1). Then we have 

H c (trn + (1 - t)r? 2 ) = L c (trn + (1 - t)ri2, trjt + (1 - t)r]2) 

= tL c ( m , t m + (1 - t) m ) + (1 - t) L c (r] 2 , t m + (1 - t)7] 2 ) 

> t (h c ( Vi ) + £(1 - t) 2 ( m - m ) 2 ) + (1 - 1) (hm + ^t 2 (7?i - m ) 



= tH c ( m ) + (1 - 1) H c { m ) + ~t(i - t)( m - m? ■ 

Thus H c is A-strongly concave. □ 

Lemma 9. Let A > and let c : {±1} x [0, 1]— >M*^_ be a regular proper loss. If H c is X-strongly 
concave, then c is X-strongly proper. 

Proof. Let 77, 77 S [0, 1]. By Theorem [3l there exists a superderivative H'Jyf) of H c at 77 such that 

L c (?7, rf) = H c (rJ) + (77 - rf) ■ H' c (rj) . 

This gives 

L c ( v ,tj) - H c (n) = H c (r})-H c (r,) + (V-V)-H' c (rf) 

> —(77 — 7]) 2 , since H c is A-strongly concave. 

Thus c is A-strongly proper. □ 
This gives us the following characterization of strong properness for regular proper losses: 

Theorem 10. Let A > and let c : {±1} x [0, 1]— >M.+ be a regular proper loss. Then c is X-strongly 
proper if and only if H c is X-strongly concave. 

Several examples of strongly proper (composite) losses will be provided in Section 15.21 and 
Section 15.31 Theorem [10] will form our main tool in establishing strong properness of many of these 
loss functions. 
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5 Regret Bounds via Strongly Proper Losses 

We start by recalling the following result of [8] (adapted to account for ties, and for the conditioning 
on Y ^ Y'): 

Theorem 11 (|8j). For any f : X^R*, 



regret^/] = ^—- E x>x , [\ V (X) - r,(X')\ • (l((/(X) - f(X'))( V (X) - rj(X')) < 0) 

+ §l(/(X) = /(X') 



As noted by [9] , this leads to the following corollary on the regret of any plug-in ranking function 
based on an estimate rj: 

Corollary 12. For any rj : X— >-[0, 1], 

regret^] < ^±-^B x [\rj(X) - V (X)\] . 
For completeness, a proof is given in Appendix [A] We are now ready to prove our main result. 

5.1 Main Result 

Theorem 13. Let y C R* and let A > 0. Let I : {±1} x y~ >R+ be a X-strongly proper composite 
loss. Then for any f : X— >y, 

regret^ [/] < ^ ^/regret^ [/] . 
p(l - pjVA v 

Proof. Let c : {±1} x [0, 1]— >R+ be a A-strongly proper loss and ip : [0, 1]— >y be a (strictly increas- 
ing) link function such that £(y,y) = c(y, for all y 6 {±l},y G y. Let / : X— >y. Then we 
have, 

regret^ nk [/] = regret™ 14 ^ -1 o /] , since ifj is strictly increasing 

< * ? E x [\il>-Hf(X))-ri(X)\], by Corollary O 



< 



< 



p(l- 


p) 


1 




p(l- 


p) 


1 




p(l- 


p) 


1 




p(l- 


p) 


1 





E x \\^(f(X))- v (X)\ 1 



^B x [{^(f(x))-r,(x)y 

by convexity of <p{u) = u 2 and Jensen's inequality 
Y~ ^ jB x [R c (ri(X),ip- 1 (f(X)))] , since c is A-strongly proper 

2 B x [R e ( V (X),(f(X))] 



p{l-p) V A 
V2 



regret^ [/] . 



p(l - p)yf\ 

□ 
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Theorem 1131 shows that for any strongly proper composite loss £ : {±1} x 3^ — , a function 
/ : X— >y with low ^-regret will also have low ranking regret. Below we give several examples of 
such strongly proper (composite) loss functions; properties of some of these losses are summarized 
in Table [TJ 



5.2 Examples 

Eaxmple 1 (Exponential loss). The exponential loss £ exp : {if} x M*— t-K^ defined as 

£ exp (y,y) = e-yy 

is a proper composite loss with associated proper loss c exp : {±1} x [0, 1]— >M.*^_ and link function 
•0 exp : [0, l]-)-K* given by 

Ccxp(y,^) = ' ^exp(^) = ^ 1]1 (t4^) • 

It is easily verified that c exp is regular. Moreover, it can be seen that 

H c ^{ri) = 2^(1 - t?) , 

with 

- H >^ = mkm - 4 v " e[M - 

Thus H exp is 4-strongly concave, and so by Theorem{Wi we have £ cxp is 4-strongly proper composite. 
Therefore applying Theorem [73] we have for any f : X— >M* , 

1 



regretS 1 * [/] < _ ~ } V regret^ [/] ■ 

Eaxmple 2 (Logistic loss). The logistic loss £ exp : {±1} x M*— defined as 

£io S (y,y) = ln(i + e-^) 

is a proper composite loss with associated proper loss c\ og : {±1} x [0, 1]— >M*l and link function 
iplog : [0, 1]— >R* given by 

Qo g (l,??) = -hiT?; ciog (-1,77) = -ln(l-r)); ipioAv) = m 



1 — 77 



Again, it is easily verified that c\ og is regular. Moreover, it can be seen that 

H\ os (t]) = -77 In 77- (1-77)111(1-77), 

with 

-KM = > 4 V77G[0,1]. 

Thus H\ og is 4-strongly concave, and so by Theorem ] we have £\ og is 4-strongly proper composite. 
Therefore applying Theorem{T^ we have for any f : X—>M*, 



regret r D ank [/] < ^regret? [/] . 
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Eaxmple 3 (Squared and squared hinge losses). The (binary) squared loss (1 — yy) 2 and squared 
hinge loss ((1 — yy)^) 2 (where u + = max(u, 0)) are generally defined for y G R. To obtain class 
probability estimates from a predicted value y G M, one then truncates y to [—1,1], and uses fj = 
fSlf . To obtain a proper loss, we can take y = [—1,1]; in this range, both losses coincide, and we 
can define 4q : {±1} x [—1, 1]— >[0, 4] as 

4 q (y,y) = {l-yyf. 

This is a proper composite loss with associated proper loss c sq : {±1} x [— 1, 1]— >[0, 4] and link 
function -0 sq : [0, 1]— >[— 1, 1] given by 

c sq (l,9f) = 4(1 — ; c sq (-l,^) = 4ry 2 ; ^(77) = 2rf-l. 

It can be seen that 

L S q(v,V) = 4?7(1 - rj) 2 + 4(1 - r,)f 

and 

H sq (v) = 47/(1-7/), 

so that 

L sq (ri, rf) - H sq (ri) = 4(77 - ??) 2 . 
Thus £ sq is 8-strongly proper composite, and so applying Theorem\l^we have for any f : X—>[—l, 1], 

regret r D ank [/] < — 

Note that, if a function f : X— KR is learned, then our bound in terms of £ sq -regret applies to the 
ranking regret of an appropriately transformed function f : X— >[— 1,1], such as that obtained by 
truncating values f(x) ^ [—1,1] to the appropriate endpoint —1 or 1: 

r -1 iff{ x )<-\ 

J{x) = I f{x) if fix) Z [-1,1] 
[ 1 if fix) > I. 

5.3 Constructing Strongly Proper Losses 

In general, given any concave function H : [0, 1]— one can construct a proper loss c : {±1} x 
[0, 1]-^ with H c = H as follows: 

0(1,77-) = Hirj) + {l-rj)H'{rj) (31) 
c(-1,t7) = Hirj) - 77^(77) , (32) 

where H'(rj) denotes any super derivative of H at 77. It can be verified that this gives L c (r/, 77) = 
H(jf) + (77 — rj)H'(rj) for all rj, rf G [0, 1], and therefore H c (rj) = H(rj) for all 77 G [0, 1]. Moreover, 
if H is such that #(77) + (1 - rj)H'\rj) G M+ Vrj G (0, 1] and Hir}) - rjH'{rj) G M + V?? G [0, 1), then 
the loss c constructed above is also regular. Thus, starting with any A-strongly concave function 
H : [0, 1]— satisfying these regularity conditions, any proper composite loss £ formed from the 
loss function c constructed according to Eqs. (|311132j) (and any link function tp) is A-strongly proper 
composite. 
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Table 1: Examples of strongly proper composite losses i : {±1} x y~ >W + satisfying the conditions 
of Theorem CEl together with prediction space y, proper loss c : {±1} x [0, 1]— »R+, link function 
ip : [0, 1]— >y, and strong properness parameter A. 



Loss 


y 


Uy,y) 


c(y, rj) 




A 


y = i 


y = -i 


Exponential 


w 


e -vv 


V V 


1 V 
V 




4 


Logistic 


w 


in(i + e -yy) 


— lnry 


— ln(l — rj) 




4 


Squared 


[-1,1] 


(i - yy) 2 


4(1 - rf) 2 




2?7- 1 


8 


Spherical 


[0,1] 


c{y, y) 


1 ; 1 

y/fp+ii-n) 2 


1 


ry 


1 


Canonical 
'exponential' 


R* 


\A + (f) 2 -f 


V V 


/ V 

V i-9 


2f;— 1 

y/v(l—v) 


4 


Canonical 
squared 


[-1,1] 


|(i -yy? 




^2 


2rj- 1 


2 


Canonical 
spherical 


[-1,1] 


i-i(v/2-y 2 + 2/y) 


1 J 1 


l 


2f;— 1 

V?r 2 +(i-f/) 2 


1 



Eaxmple 4 (Spherical loss). Consider starting with the function -ffspher : [0, 1]— kR defined as 



H sphcr ( V ) = 1 - vV + (1 - ?/) 2 

TTten 

-^spherl 7 ?) 



vV + (l"*/) 2 

and 

-<he,fo) = (?y2 + (1 1 _ v)2)3/2 > 1 Vr, e [0,l], 

anc? therefore H sp h er is 1-strongly concave. Moreover, since -f/spher and H' sphcr are both bounded, the 
conditions for regularity are also satisfied. Thus we can use Eqs. L'JlWVty) to construct a 1-strongly 
proper loss c sp h er : {±1} x [0, 1]— >K as follows: 

ri 

c S pher (1,??) = H spher .(rj) + (1 - fj)H 8pheI (fj) = 1 



Cspher (-1,^) = H sphei (rf) - rfH' hei (rf) = 1 — 1 V = . 

yV + (1 - ri) 1 

Therefore by Theorem \13l we have for any f : X— >[0, 1], 

V2 



regret^/] < [/] • 

TTie /oss Cgpher above corresponds to the spherical scoring rule described in jll5f . 
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We also note that, for every strictly proper loss c : {±1} x [0, 1]— >M*,, there is an associated 
'canonical' link function ip : [0, 1]— >y defined as 

rP(rf) = c(-l,fj) - c(l,rji) , (33) 

where y = {ip{rj) : rj S [0, 1]}. We refer to composite losses comprised of such a strictly proper loss c 
with the corresponding canonical link ip as canonical proper composite losses. Clearly, multiplying 
c by a factor a > results in the corresponding canonical link ip also being multiplied by a; adding 
a constant (or a function 9(y, rj) = 9(rj) that is independent of y) to c has no effect on ip. Conversely, 
given any J'Cl* and any (strictly increasing) link function ip : [0, 1]— >y, there is a unique strictly 
proper loss c : {±1} x [0, 1]— >W + (up to addition of constants or functions of the form 6(y, rj) = 0(rj)) 
for which ip is canonical; this is obtained using Eqs. (|31H32p with H satisfying H'irf) = —ip(rj) (with 
possible addition of a term 0(fj) to both c(l,fj) and c(—l,fj) thus constructed). Canonical proper 
composite losses £(y, y) have some desirable properties, including for example convexity in their 
second argument y for each y G {±1}; we refer the reader to (6j [22] for further discussion of such 
properties. 

We note that the logistic loss in Example 2 is a canonical proper composite loss. On the 
other hand, as noted in [6], the link V'exp associated with the exponential loss in Example 1 is 
not the canonical link for the proper loss c ex p (see Example 5). The squared loss in Example 3 is 
almost canonical, modulo a scaling factor; one needs to scale either the link function or the loss 
appropriately (Example 6). 

Eaxmple 5 (Canonical proper composite loss associated with c exp ). Let c exp : {±1} x [0, 1]— >M.+ 
be as in Example 1. The corresponding canonical link V'exp.can : [0, 1]— >-R* is given by 



^/ ; exp,can(^?) 



rj 1 — rj 2rj — 1 



l-V V *7 VW- - V) 



With a little algebra, it can be seen that the resulting canonical proper composite loss ^ CX p,can : 
{±1} x R*— >R^_ is given by 



4 xp ,can(y,y) = ~y- 

Since we saw c exp is 4-strongly proper, we have £ exPtCa , n is ^-strongly proper composite, and therefore 
we have from Theorem [121 that for any f : X— >W , 



regret^/] < * _ - ^regretg p ' can [/] . 

Eaxmple 6 (Canonical squared loss). For c sq : {±1} x [0, 1]— >[0, 4] defined as in Example 3, the 
canonical link V'sq,can : [0, 1]— >y is given by 

tp sm (v) = 4f -4(1-^) 2 = 4(2r7-l), 

with y = [—4,4], and the resulting canonical squared loss ^ S q,can : {if} x [ — 4, 4]— >[0, 4] is given by 

yy\2 



4q,can(y,y) = (l - — ) . 
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Since we saw c sq is 4-strongly proper, we have 4q,can is 4-strongly proper composite, giving for any 
/:*->[-4,4], 

regretS nk [/] < - } ^regretg^t/] . 

For practical purposes, this is equivalent to using the loss £ sq : {±1} x [— 1, 1]— >[0, 4] defined in 
Example 3. Alternatively, we can start with a scaled version of the squared proper loss c sq / : 
{±1} x [0, 1H[0, 1] defined as 

c sq '(l,v) = (l-^/) 2 ; c sq >(-l,rj) = rf , 

for which the associated canonical link V'sq'^an : [0, 1]— >y is given by 

VV,can(?) = V 2 ~ ( l ~V) 2 = 2r? - 1 , 

with y = [—1, 1], and the resulting canonical squared loss l sq > tC a,n : {±1} x [ — 1> I] - ^[0, 1] is given by 

(i - yy? 



t'Sq^can 



(y,y) 



Again, it can be verified that c sq > is regular; in this case H sq i(r]) = n(l — rf) which is 2-strongly 
concave, giving that -ff S q',can is 4-strongly proper composite. Therefore applying Theorem [73] we 
have for any f : X— >■[—!, 1], 



regret^ k [/] < — ±— ^/regret?' can [/] . 

Again, for practical purposes, this is equivalent to using the loss £ sq defined in Example 3. 

Eaxmple 7 (Canonical spherical loss). For c sp h e r : {=tl} x [0, 1]— )-M defined as in Example 4, the 
canonical link V'spher,can : [0, 1]— >y is given by 

V'spher,can (v) - 



with y = [—1, 1]. The resulting canonical spherical loss ^ S pher,can : {±1} x [ — 1) 1]— >-M is given by 



4pher,can(2/, V) = 1 ~ ^ (V 2 ~ P + 

Since we saw c sp her is 1-strongly proper, we have -4 p her,can is 1-strongly proper composite, and 
therefore we have from Theorem [23 that for any f : X^? [—1,1], 



regret^!/] < -^-^ ^/regret^' can [/] 
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6 Tighter Bounds under Low-Noise Conditions 



In essence, our results exploit the fact that for a A-strongly proper composite loss I formed from 
a A-strongly proper loss c and link function tp, given any scoring function /, the ^(aO distance 
(where \x denotes the marginal density of D on X) between ip -1 (/ (X)) and rj(X) (and therefore the 
Li(/i) distance between tp~ 1 (f(X)) and rj(X), which gives an upper bound on the ranking risk of 
/) can be upper bounded precisely in terms of the ^-regret of /. From this perspective, rj = ip" 1 o f 
can be treated as a 'plug-in' scoring function, which we analyzed via Corollary 1121 

Recently, [9] showed that, under certain low-noise assumptions, one can obtain tighter bounds 
on the ranking risk of a plug-in scoring function rj : X— >[0, 1] than that offered by Corollary 1121 
Specifically, [9] consider the following noise assumption for bipartite ranking (inspired by the noise 
condition studied in [28j for binary classification): 

Noise Assumption NA(a) (a G [0,1]): A distribution D on X x {±1} satisfies assumption 
NA(a) if 3 a constant C > such that for all x G X and t G [0, 1], 

V x {\r,{X)-r,{x)\ <t) < C ■ t a . 

Note that a = imposes no restriction on D, while larger values of a impose greater restrictions. [9] 
showed the following result (adapted slightly to our setting, where the ranking risk is conditioned 
on Y Y'): 

Theorem 14 ([9]). Let a G [0, 1) and q G [l,oo). Then 3 a constant C 0tq > such that for any 
distribution D on X x {±1} satisfying noise assumption NA(a) and any rj : X— >[0, 1], 

regret^] < (e x [\v(X) - r,(X)\ q ]) ^ . 

This allows us to obtain the following tighter version of our regret bound in terms of strongly 
proper losses under the same noise assumption: 

Theorem 15. Let y C R* and A > 0, and let a G [0, 1). Let £ : {±1} x y^R* + be a X-strongly 
proper composite loss. Then 3 a constant C a > such that for any distribution D on X x {±1} 
satisfying noise assumption NA(a) and any f : X^y, 

l + a 

regretS*!/] < (£) ^ (regret^/]) ^ . 

Proof. Let c : {±1} x [0, 1]— >W + be a A-strongly proper loss and tp : [0, 1]— >y be a (strictly increas- 
ing) link function such that £(y, y) = c(y, %jj~ l {y)) for all y G {±1}, y G y. Let D be a distribution 
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on X x {±1} satisfying noise assumption NA(a) and let / : X—*y. Then we have, 
regret^f nk [/] = regret5f nk [/0 -1 o /] , since ij) is strictly increasing 



< 



< 



p(l-p) 



Ex 



1 + q 

2 + q 



{ip-\f(X))-r,(X)) 2 

by Theorem 1141 taking q = 2 



a 



Q,2 



1 + q 

2 + q 



since c is A-strongly proper 



p(l-p) \X 

C Q , 2 



-E x [^(r/(X),/(X))] 



l + q 
2+q 



l + q 
2\ 2+^ 



The result follows by setting C a = C a ,2- 



regretfjf/; 



1 + q 

2 + q 



□ 



For a = 0, as noted above, there is no restriction on D, and so the above result gives the same 
dependence on regret^, [/] as that obtained from Theorem 1131 On the other hand, as a approaches 
1, the exponent of the regret^ [/] term in the above bound approaches |, which improves over the 
exponent of | in Theorem 1131 



7 Conclusion and Open Questions 

We have obtained upper bounds on the bipartite ranking regret of a scoring function in terms of 
the (non-pairwise) regret associated with a broad class of proper (composite) losses that we have 
termed strongly proper (composite) losses. This class includes several widely used losses such as 
exponential, logistic, squared and squared hinge losses as special cases. 

The definition and characterization of strongly proper losses may be of interest in its own right, 
and may find applications elsewhere. An open question concerns the necessity of the regularity 
condition in the characterization of strong properness of a proper loss in terms of strong concavity 
of the conditional Bayes risk (Theorem I10j) . The characterization of strict properness of a proper 
loss in terms of strict concavity of the conditional Bayes risk (Theorem d]) does not require such an 
assumption, and one wonders whether it may be possible to remove the regularity assumption in 
the case of strong properness as well. 

Many of the strongly proper composite losses that we have considered, such as the exponential, 
logistic, squared and spherical losses, are margin-based losses, which means the bipartite ranking 
regret can also be upper bounded in terms of the regret associated with pairwise versions of these 
losses via the reduction to pairwise classification (Section I3.ip . A natural question that arises is 
whether it is possible to characterize conditions on the distribution under which algorithms based 
on one of the two approaches (minimizing a pairwise form of the loss as in RankBoost/pairwise 
logistic regression, or minimizing the standard loss as in AdaBoost/standard logistic regression) 
lead to faster convergence than those based on the other. We hope the tools and results established 
here may help in studying such questions in the future. 
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A Proof of Corollary [12 



Proof. Let rj : X—>[0, 1]. By Theorem II 11 we have 

regret^] < ^i-_E x , x , [\ V (X) - V (X')\ • l((rj(X) - v(X'))(r,(X) - V (X')) < 0) 

The result follows by observing that for any x,x' £ X, 

(rf(x) — rj(x'))(r](x) — r](x')) < \v( x ) ~ v( x ')\ — \v( x ) ~ n ( x )\ + \v( x> ) ~ ^(^Ol • 

To see this, not that the statement is trivially true if rj(x) = rj(x'). If rj(x) > t](x'), then we have 

(rj(x) — rj(x'))(r](x) — rj(x')) < rj{x) < rf(x') 

=> T](x) — 7](x') < (j](x) — rj(x)) + (rf(x r ) — r](x')) 

r](x) — r](x') < \r)(x) — rf(x)\ + \r}(x') — rj(x')\ 
==> \t]{ x ) ~ v( x ')\ — \v( x ) ~ v( x )\ + — ^(^Ol • 

The case r){x) < r)(x') can be proved similarly. Thus we have 

regretg nk [^] < 1 - E x , x ,\\rj(X) - r,(X) \ + \rj{X') - V (X>) \ 



1 



p(l-p) 



E 



x 



\rj(X)- V (X)\ 



□ 
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